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Summary A new method to simulate probability distri- 
butions in regions where the events are very unlikely (e.g. 
p ~ 10 °) is presented. The basic idea is to represent the 
underlying probability space by the phase space of a physical 
system. The system is held at a temperature T, which is cho- 
sen such that the system preferably generates configurations 
which originally have low probabilities. Since the distribution 
of such a physical system is know from statistical physics, the 
original unbiased distribution can be obtained. 

As an application, local alignment of protein sequences 
based on BLOSUM62 substitution scores with (12, 1) affine 
gap costs are considered. The distribution of optimum 
sequence- alignment scores S is studied numerically over a 
large range of scores. The deviation of p(S) from the extreme- 
value (or Gumbel) distribution is quantified. This deviation 
decreases with growing sequence length. 

Introduction Modern molecular biology, e.g. the hu- 
man genome project JjJ , relies heavily on the use of large 
databases (|||, where DNA or protein sequences are 
stored. The basic tool for accessing these databases and 
comparing different sequences is sequence alignment. The 
result of each comparison is an maximum alignment score 
S. One is interested either in global or local optimum 
alignments. For the first case, the score is maximized 
over all alignments which include both sequences fully, 
while for the second case, the optimum is calculated over 
all possible subsequences. To estimate the significance 
of the result of a comparison, one has to know, based 
on a random model, the statistical distribution p(S) of 
scores. For biologically relevant models, e.g. for protein 
sequences with BLOSUM62 substitution scores Q and 
affine gap costs |^], p(S) is not known in the interesting 
region, where p(S) is small. A number of empirical stud- 
ies P-fl0| for local alignment, in the region where p(S) is 
large, suggest that p(S) is an extreme value (or Gumbel) 
distribution Ol 



PG (S) = Ae- A ( s -") exp(-e- A ( s - u )) 



(1) 



where u denotes the maximum of the distribution and A 
characterizes the behavior for large values of S, i.e. the 
tail of the distribution. 

In this work, to determine the tail of p(S), a rare 
event simulation is applied. For dynamical problems, 
like investigating queuing systems or studying the reli- 
ability of technical components, several techniques Jl2| , 
mostly based on importance sampling have been devel- 
oped. Related methods have been introduced in physics, 
like multi- canonical sampling fill or the pruned- enriched 
Rosenbluth method ImI. 



By simply changing perspective, one can apply these 
standard techniques to many other problems. Here, the 
method it is applied to the sequence alignment problem. 
Instead of directly drawing the random sequences, the 
basic idea is to use a physical system, which has a state 
given by a pair of sequences and is held at temperature T. 
This is similar to the simulated annealing approach |l5[ , 
used to solve hard optimizations problems. The state 
of the system changes in time, governed by the rules of 
statistical mechanics. The energy E of the system is 
defined as E — —S. Therefore, at low temperatures the 
system prefers a state with high score value S. Since the 
thermodynamic properties of the system are known, it is 
possible to extract from the measured distribution p* (S) 
of scores the target distribution p(S). 

For the alignment problem two sequences x = 
X1X2 ■ ■ ■ x n and y = yiyz ■■ - Vm over a finite alphabet 
with r letters are given. For DNA the alphabet has 
4 letters, representing the bases, for protein sequences 
it has 20 letters, representing the amino acids. Let f 
be the probability for the occurrence of letter i, assum- 
ing here that all letters of a sequence are independent. 
An alignment is a pairing {{xi k , yj k )} (fc = 1,2, . . . ,K, 
I < ik < ifc+i < n and 1 < jk < jk+i < m ) of let- 
ters from the two sequences. Please note that some let- 
ters may not be aligned, i.e. gaps occur. To each align- 
ment a score is assigned, via a scoring function s(x,y). 
The total score is the sum of scores of all aligned letters 
J2k s i x ikTVjk) P ms the costs of all gaps. Here, so called 
affine gap costs (a, 0) are considered, i.e. a gap of length 
I has costs g(l) — —a — (3(1 — 1). For both global and 
local alignment efficient algorithms 16 exist, which 



calculate an optimum alignment in time 0(nm). Hence, 
one can easily generate e.g. N rs 10 5 samples of pair 
of sequences according the frequencies obtain each 
time the optimum alignment, and calculate a histogram 
of the scores S. This simple sampling allows one to cal- 
culate p(S) in the region were the probabilities are large 
(e.g. p(S) ~ 10 -4 ). Recently, the island method Q was 
introduced, which allows a speed up of several orders of 
magnitudes, but still the far end of the distribution is out 
of reach. 

Algorithm The basic idea to determine the behav- 
ior of p(S) at the rare event tail (e.g. p(S) « 10~ 40 ) 
is to view a pair of sequences as a physical system, 
which behaves according the rules of statistical mechan- 
ics. This means that instead of considering many inde- 
pendent pairs of fixed sequences, one pair of sequences 
c(i) = (x(t),y(i)) is used, which changes at discrete 
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times t. Mathematically speaking, a Markov chain |19] 
c(Q) — > c(l) — » c(2) — * ... is used to generate the in- 
stances. The behavior of a Markov chain is not deter- 
ministic, but governed by probabilities p(c — > c ) that 
state c follows immediately after state c. 

The simplest rule for the transition is, to choose ran- 
domly a position in one of the sequences and choose ran- 
domly a new letter from the alphabet, with all positions 
being equiprobable and the letters having probabilities 
fi, i.e. p(c — * c ) = fi/(n+m) if c, c differ by at most one 
letter, and pic — > c ) = otherwise. With this choice of 
the transition probabilities, for t — > 00 all possible pairs 
of sequence have the probability P(c) = Y\ i f Xi JT ■ / Vj of 
occurring. Hence, simple sampling is reproduced. 

To increase the efficiency, one can apply importance 
samplin g [h2 [ |2C| ], a standard method for simulating rare 
events which allows one to concentrate the sam- 

pling in small regions in configuration space. Suppose, 
we want to calculate the expectation value (g) p of a 
function g on a (here discrete) configuration space {c}, 
with respect to the probabilities P(c). Thus, we have 
(g) = E c g(c)P(c). To estimate (g)p, we generate N 
samples which are distributed according to P, and 
calculate Let now Q(c) be another prob- 

ability distribution with Q(c) ^ for all c. Then we can 
write (g) P = £ c 2M^Q( C ) = (gP/Q) Q . This means, 

when we generate N configurations according to the 
probabilities Q(c), then l/NJ2i f{c {l) )P(c {l) )/Q(c (i) ) is 
also an estimator for (f)p. By choosing Q such that 
the sampling occurs in the region of the phase space we 
arc interested in, we can obtain a much higher accuracy 
there. 

A good choice for the transition rule of the sequence- 
alignment problem is to select one position in the pair 
c of sequences randomly, change the letter randomly 
according the frequencies fi, recalculate the optimum 
alignment S(c ) with a standard algorithm and accept 
this move c — > c with probability max(l, exp(AS/T)), 
where AS = S(c ) — 5(c). This transition probabili- 
ties describe a physical system at temperature T with 
energy E = —S. The advantage of this approach is 
that the equilibrium distribution Q(c) is known from sta- 
tistical physics |§: Q(c) = P(c)exp(S(c)/T)/Z with 
Z(T) — J2 C P( c ) ex P(5(c)/T) being a normalization con- 
stant, called the partition function. Thus, the probability 
to have score S is 

expQS/T) ^' 

= z^r) ^ ' ^ 

where the sum ^ runs over all sequences with score 
5. Thus, to obtain p{S), one simulates the system at 
temperature T and records a histogram of scores p*(S). 
Then, the unbiased estimator for the distribution is 

p(S) = £ c 'p(c) = p*(S)Z(T)exp(-S/T). Note that 
Z(T) is unknown a priori, but can be determined very 
easily, as shown below. 



To describe the behavior of p(S) over a wide range, 
the model must be simulated at several temperatures. 
For this reason, and because at low temperatures the dy- 
namics are slow in general so we want to increase the effi- 
ciency, the model is simulated via the parallel tempering 
method (2^j2^]. Using this technique, the system is simu- 
lated at Nt different temperatures To < T\ < . . . < Tn t 
in parallel, i.e. with Nt independent pairs c{T{) of se- 
quences. The main idea of parallel tempering is, that 
from time to time the configurations between neighboring 
temperatures Tj, Tj + \ are exchanged according a proba- 
bilistic rule |2^,|2^]. Here, each simulation step consists 
of one Markov step for each configuration c and one ex- 
change step between one neighboring pair c(Tj), c(T;+i). 

Example Next, a simple example is given, illustrat- 
ing how the method works. Optimum local alignments 
without gaps for sequences of equal length m — n = 20 
and r — 4 letters, all having the same probability 1/4, 
are calculated. For the test the following score is ap- 
plied: s(x, y) = 1 if x = y and s(x, y) = —3 otherwise. 
Two types of runs are performed: (a) initially, all pairs 
of sequences are random, and (b) each pair consists of 
two equal sequences. Thus, for the first type, initially 
the score is low, while for the second type the score is 
initially maximal. This provides a criterion for equilibra- 
tion: if the average score for both types of initial configu- 
ration agree within error bars (at time to), the simulation 
is long enough. In Fig. the average optimum score S 
from 1000 independent runs and four different tempera- 
tures T is shown. 
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FIG. 1. Average alignment score S as a function of step 
t for n, m = 20, 4 letters, local alignment without gaps for 
different temperatures T. For each temperature, 1000 simu- 
lations were started with two random sequences (low scores) 
and 1000 simulations with two equal sequences (high scores). 



To obtain uncorrelated samples, only samples at to, 
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to + t, <o + 2r etc are taken, where r is the characteristic 
time in which the score-score correlation 



cs(to,t) 



(S(t )S(t)) - (Si 
(S 2 ) - (S) 2 



(3) 



decreases to 1/e. 

In Fig. ^ the raw distributions of S for two tempera- 
tures is shown together with a distribution from a simple 
sampling of TV = 10 4 realizations. Clearly, with the sta- 
tistical mechanics approach, the region of high scores is 
sampled much more frequently. 
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FIG. 2. Raw distribution of alignment scores S for the 
direct simulation and for T = 0.57 and T = 0.69 with 
n — m — 20, 4 letters and local alignment without gaps. 



For low scores, the final distributions obtained from 
the simple sampling and from the finite-temperature sim- 
ulation must agree. This can be used to determine the 
constant Z(T). It is chosen such that the difference in 
an interval [S± , S2] between the simple sampling distri- 
bution and the rescaled distribution at T is minimal. In 
the same way the constants at lower temperatures can be 
obtained by matching to distributions obtained before at 
higher temperatures. The final distribution is shown in 
Fig. ||. For each datapoint, the distribution with the 
highest accuracy was taken. For comparison, a simple 
sampling distribution obtained using a huge number of 
samples (N — 10 9 ) is shown. Both results agree very 
well. Please note that the distribution from the finitc- 
T approach spans almost the entire interval [0,20]. In 
principal, the region for very small score S can be in- 
vestigated also using the new method by simulating at 
negative temperature. How powerful the new method is 
can be seen by looking at the right border of the inter- 
val, where a value p(20) = 9.13(20) x 10 -13 was obtained. 
This agrees within error bars with the exact result |2q| 



0.25 20 w 9.09 x 10" 13 . This example illustrates, that the 
method presented here is indeed able to calculate accu- 
rately the distribution p(S) of optimum alignment scores 
in regions where p(S) is very small. 
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FIG. 3. Rescaled distribution p(S) for the direct simulation 
and for T = 0.57, T = 0.69 for n,m = 20, 4 letters, local 
alignment without gaps. The solid line is the result of a large 
simple sampling simulation with N = 10 9 samples. 



Results Next, the results for a biologically relevant 
case are presented. Sequences of amino acids distributed 
according the background frequencies by Robinson and 
Robinson Q are used together with the BLOSUM62 
scoring matrix j| for (12, 1) affine gap costs. This type 
of system has been studied in Ref. |l(| in the region 
where p(S) is large. Here, sequences of length n = m 
in the range [40, 400] were considered. The simulations 
were performed for ht — 7 temperatures T £ [2 ... 10] 
([3.5 ... 10] for n, m — 400), with up to 100 independent 



runs of length up to t n 



4 x 10 5 steps. For the low- 



est temperatures it was not possible to equilibrate the 
system within the given time. The reason is that near 
T « 1/A the equilibration times seem to diverge. This 
indicates a phase transition in the physical system with 
(probably) a glassy phase at low temperatures. Hence, 
for the evaluation the only temperatures used, were those 
where equilibration was possible. 

In Fig. |] the distributions p(S) of optimum alignment 
scores obtained in this way are shown. It was possible 
to determine the distribution in regions where p(S) is 
as small as 10 -40 . To obtain the same accuracy with a 
simple-sampling approach, given a computer which opti- 
mizes say 10 6 samples per second, a total simulation time 
of about 2.5 x 10 17 times the age of the universe would be 
necessary. Also shown in Fig. Q are fits of the low-score 
data to Gumbcl distributions. The resulting parameters 
(A, u) are comparable to the values found |jl(| before and 
depend slightly on the sequence length. For high scores, 
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FIG. 4. Distribution of alignment scores 
S for L = 40, 100, 200, 400, BLOSUM62 substitution matrix 
and affine (12,1) gap costs. The thin solid lines are fits to 
extreme- value distributions with parameters (A,u), yielding 
(A,u) = (0.355(5), 15.35(4)) (n,m = 40), (0.304(2), 21.67(4)) 
(n, m = 100) and (0.280(3), 32.01(3)) for n,m = 400. 



increasing sequence lengths, the distribution indeed ap- 
proaches the Gumbel distribution. Hence, for very long 
sequences, when one is not interested in the deviation 
from the Gumbel form, the island method jlq] is a more 
suitable tool to study sequence alignment. Please note 
that the island method is not a general purpose method 
like the technique presented here, but designed especially 
for local alignments. 

Acknowledgements The author developed the idea for 
this method at the workshop "Statistical Physics of Bi- 
ological Information" at the Institute for Theoretical 
Physics in Santa Barbara during discussions with P. 
Grassberger and E. Marinari. The author would like 
to thank A. P. Young and P. Grassberger for critically 
reading the manuscript and interesting discussions and 
A. P. Young also for various other support. The simula- 
tions were performed on a Beowulf Cluster at the Institut 
fur Theoretische Physik of the Universitat Magdeburg 
(Germany). Financial support was obtained from the 
DFG (Deutsche Forschungsgemeinschaft) under grant Ha 
3169/1-1. 



significant deviations are visible, in contrast to the ear- 
lier predictions. Since the deviations occur at high score 
values, they could not be detected before using conven- 
tional methods. The reason for the deviations is edge 
effects: very long alignments cannot start near the end 
of either of the sequences, so they become even more un- 
likely. The results found here can be fitted very well to 
modified Gumbel distributions of the form 

p G {S) = fcA e - A(s - tl) - A2(s -" )2 exp(-e~ A(s - u) ) , (4) 

with k ~ 1, resulting in values for (A, A2, u) of (0.3277 ± 
0.0003, 8.56 x 10~ 4 ±3 x 10~ 6 , 15.35±0.04) for n,m = 40, 
(0.2783 ± 0.0003, 1.72 x 10~ 4 ± 1 x 10~ 6 , 21.67 ± 0.04) 
for n,m = 100, and (0.2733 ± 0.0004,6.1 x 10~ 5 ± 2 x 
10" 6 , 32.01 ± 0.03) for n,m = 400. Anyway, with in- 
creasing lengths n, to, on a scale of scores proportional 
to u ~ logn, p(S) approaches the Gumbel distribution 
more and more, i.e. linin^oo A2 = 0. 

Summary A method has been presented which allows 
one to study rare events in random systems down to re- 
gions of very low probabilities. The basic idea is to repre- 
sent the probability space by the phase space of a physical 
system held at temperature T. From the distribution of 
states, the original unbiased distribution can be obtained. 

Here, the method is applied to the local sequence- 
alignment problem. A biologically relevant case is 
treated. The distribution of optimum alignment scores 
can be studied in regions where the probability is as 
small as 10~ 40 , and yet the deviations of the distribution 
from the theoretical prediction are visible. However, with 
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